Importing the data
This course will give you some additional practice with importing and cleaning data through a series of four case studies.
You’ll be importing and cleaning four real datasets that are a little messier than before. Don’t worry – you’re up for the challenge!
Your first dataset describes online ticket sales for various events across the country. It’s stored as a Comma-Separated Value (CSV) file called sales.csv. Let’s jump right in!
# Import sales.csv: sales
sales <- read.csv("../xDatasets/sales.csv", stringsAsFactors = FALSE)Examining the data
As you know from the Cleaning Data in R course, the first step when preparing to clean data is to inspect it. Let’s refresh your memory on some useful functions that can do that:
dim()returns the dimensions of an object
head()displays the first part of an object
names()returns the names associated with an object
# View dimensions of sales
dim(sales)## [1] 5000 46
# Inspect first 6 rows of sales
sales %>%
head() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| X | event_id | primary_act_id | secondary_act_id | purch_party_lkup_id | event_name | primary_act_name | secondary_act_name | major_cat_name | minor_cat_name | la_event_type_cat | event_disp_name | ticket_text | tickets_purchased_qty | trans_face_val_amt | delivery_type_cd | event_date_time | event_dt | presale_dt | onsale_dt | sales_ord_create_dttm | sales_ord_tran_dt | print_dt | timezn_nm | venue_city | venue_state | venue_postal_cd_sgmt_1 | sales_platform_cd | print_flg | la_valid_tkt_event_flg | fin_mkt_nm | web_session_cookie_val | gndr_cd | age_yr | income_amt | edu_val | edu_1st_indv_val | edu_2nd_indv_val | adults_in_hh_num | married_ind | child_present_ind | home_owner_ind | occpn_val | occpn_1st_val | occpn_2nd_val | dist_to_ven |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | abcaf1adb99a935fc661 | 43f0436b905bfa7c2eec | b85143bf51323b72e53c | 7dfa56dd7d5956b17587 | Xfinity Center Mansfield Premier Parking: Florida Georgia Line | XFINITY Center Mansfield Premier Parking | NULL | MISC | PARKING | PARKING | Xfinity Center Mansfield Premier Parking: Florida Georgia Line | THIS TICKET IS VALID FOR PARKING ONLY GOOD THIS DAY ONLY PREMIER PARKING PASS XFINITY CENTER,LOTS 4 PM SAT SEP 12 2015 7:30 PM | 1 | 45 | eTicket | 2015-09-12 23:30:00 | 2015-09-12 | NULL | 2015-05-15 | 2015-09-11 18:17:45 | 2015-09-11 | 2015-09-12 | EST | MANSFIELD | MASSACHUSETTS | 02048 | www.concerts.livenation.com | T | N | Boston | 7dfa56dd7d5956b17587 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2 | 6c56d7f08c95f2aa453c | 1a3e9aecd0617706a794 | f53529c5679ea6ca5a48 | 4f9e6fc637eaf7b736c2 | Gorge Camping - dave matthews band - sept 3-7 | Gorge Camping | Dave Matthews Band | MISC | CAMPING | INVALID | Gorge Camping - dave matthews band - sept 3-7 | %OVERNIGHT C A M P I N G%* * * * * %GORGE CAMPGROUND% GOOD THIS DATE ONLY *%SEP 3 - 6, 2009 | 1 | 75 | TicketFast | 2009-09-05 01:00:00 | 2009-09-04 | NULL | 2009-03-13 | 2009-07-06 00:00:00 | 2009-07-05 | 2009-09-01 | PST | QUINCY | WASHINGTON | 98848 | NULL | T | N | Seattle | 4f9e6fc637eaf7b736c2 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 59 |
| 3 | c7ab4524a121f9d687d2 | 4b677c3f5bec71eec8d1 | b85143bf51323b72e53c | 6c2545703bd527a7144d | Dodge Theatre Adams Street Parking - benise | Parking Event | NULL | MISC | PARKING | PARKING | Dodge Theatre Adams Street Parking - benise | ADAMS STREET GARAGE%PARKING FOR 4/21/06 ONLY%DODGE THEATRE PARKING PASS%ENTRANCE ON ADAMS STREET%BENISE%GARAGE OPENS AT 6:00PM | 1 | 5 | TicketFast | 2006-04-22 01:30:00 | 2006-04-21 | NULL | 2006-02-25 | 2006-04-05 00:00:00 | 2006-04-05 | 2006-04-05 | MST | PHOENIX | ARIZONA | 85003 | NULL | T | N | Arizona | 6c2545703bd527a7144d | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 4 | 394cb493f893be9b9ed1 | b1ccea01ad6ef8522796 | b85143bf51323b72e53c | 527d6b1eaffc69ddd882 | Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow | Gexa Energy Pavilion VIP Parking | NULL | MISC | PARKING | PARKING | Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow | THIS TICKET IS VALID FOR PARKING ONLY GOOD FOR THIS DATE ONLY VIP PARKING PASS GEXA ENERGY PAVILION FRI SEP 02 2011 7:00 PM | 1 | 20 | 2011-09-03 00:00:00 | 2011-09-02 | NULL | 2011-04-22 | 2011-07-01 17:38:50 | 2011-07-01 | 2011-07-06 | CST | DALLAS | TEXAS | 75210 | NULL | T | N | Dallas | 527d6b1eaffc69ddd882 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | |
| 5 | 55b5f67e618557929f48 | 91c03a34b562436efa3c | b85143bf51323b72e53c | 8bd62c394a35213bdf52 | Premier Parking - motley crue | White River Amphitheatre Premier Parking | NULL | MISC | PARKING | PARKING | Premier Parking - motley crue | THIS TICKET IS VALID%FOR PARKING ONLY%GOOD THIS DATE ONLY%PREMIER PARKING PASS%WHITE RIVER AMPHITHEATRE%SAT JUL 30, 2005 6:00PM | 1 | 20 | 2005-07-31 01:00:00 | 2005-07-30 | 2005-03-02 | 2005-03-04 | 2005-06-18 00:00:00 | 2005-06-18 | 2005-06-28 | PST | AUBURN | WASHINGTON | 98092 | NULL | T | N | Seattle | 8bd62c394a35213bdf52 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | |
| 6 | 4f10fd8b9f550352bd56 | ac4b847b3fde66f2117e | 63814f3d63317f1b56c4 | 3b3a628f83135acd0676 | Fast Lane Access: Journey | Fast Lane Access | Journey | MISC | SPECIAL ENTRY (UPSELL) | UPSELL | Fast Lane Access: Journey | FAST LANE JOURNEY FAST LANE EVENT THIS IS NOT A TICKET SAN MANUEL AMPHITHEATER SAT JUL 21 2012 7:00 PM | 2 | 10 | TicketFast | 2012-07-22 02:00:00 | 2012-07-21 | NULL | 2012-04-11 | 2012-07-21 17:20:18 | 2012-07-21 | 2012-07-21 | PST | SAN BERNARDINO | CALIFORNIA | 92407 | www.livenation.com | T | N | Los Angeles | 3b3a628f83135acd0676 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
# View column names of sales
names(sales)## [1] "X" "event_id"
## [3] "primary_act_id" "secondary_act_id"
## [5] "purch_party_lkup_id" "event_name"
## [7] "primary_act_name" "secondary_act_name"
## [9] "major_cat_name" "minor_cat_name"
## [11] "la_event_type_cat" "event_disp_name"
## [13] "ticket_text" "tickets_purchased_qty"
## [15] "trans_face_val_amt" "delivery_type_cd"
## [17] "event_date_time" "event_dt"
## [19] "presale_dt" "onsale_dt"
## [21] "sales_ord_create_dttm" "sales_ord_tran_dt"
## [23] "print_dt" "timezn_nm"
## [25] "venue_city" "venue_state"
## [27] "venue_postal_cd_sgmt_1" "sales_platform_cd"
## [29] "print_flg" "la_valid_tkt_event_flg"
## [31] "fin_mkt_nm" "web_session_cookie_val"
## [33] "gndr_cd" "age_yr"
## [35] "income_amt" "edu_val"
## [37] "edu_1st_indv_val" "edu_2nd_indv_val"
## [39] "adults_in_hh_num" "married_ind"
## [41] "child_present_ind" "home_owner_ind"
## [43] "occpn_val" "occpn_1st_val"
## [45] "occpn_2nd_val" "dist_to_ven"
Notice how the rows appear to represent individual purchases and the columns contain different pieces of information about each purchase.
Summarizing the data
Luckily, the rows and columns appear to be arranged in a meaningful way: each row represents an observation and each column a variable, or piece of information about that observation.
In R, there are a great many tools at your disposal to help get a feel for your data. Besides the three you used in the previous exercise, the functions str() and summary() can be very helpful.
The dplyr package, introduced in Cleaning Data in R, offers the glimpse() function, which can also be used for this purpose.
# Look at structure of sales
str(sales, give.attr = FALSE)## 'data.frame': 5000 obs. of 46 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ event_id : chr "abcaf1adb99a935fc661" "6c56d7f08c95f2aa453c" "c7ab4524a121f9d687d2" "394cb493f893be9b9ed1" ...
## $ primary_act_id : chr "43f0436b905bfa7c2eec" "1a3e9aecd0617706a794" "4b677c3f5bec71eec8d1" "b1ccea01ad6ef8522796" ...
## $ secondary_act_id : chr "b85143bf51323b72e53c" "f53529c5679ea6ca5a48" "b85143bf51323b72e53c" "b85143bf51323b72e53c" ...
## $ purch_party_lkup_id : chr "7dfa56dd7d5956b17587" "4f9e6fc637eaf7b736c2" "6c2545703bd527a7144d" "527d6b1eaffc69ddd882" ...
## $ event_name : chr "Xfinity Center Mansfield Premier Parking: Florida Georgia Line" "Gorge Camping - dave matthews band - sept 3-7" "Dodge Theatre Adams Street Parking - benise" "Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow" ...
## $ primary_act_name : chr "XFINITY Center Mansfield Premier Parking" "Gorge Camping" "Parking Event" "Gexa Energy Pavilion VIP Parking" ...
## $ secondary_act_name : chr "NULL" "Dave Matthews Band" "NULL" "NULL" ...
## $ major_cat_name : chr "MISC" "MISC" "MISC" "MISC" ...
## $ minor_cat_name : chr "PARKING" "CAMPING" "PARKING" "PARKING" ...
## $ la_event_type_cat : chr "PARKING" "INVALID" "PARKING" "PARKING" ...
## $ event_disp_name : chr "Xfinity Center Mansfield Premier Parking: Florida Georgia Line" "Gorge Camping - dave matthews band - sept 3-7" "Dodge Theatre Adams Street Parking - benise" "Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow" ...
## $ ticket_text : chr " THIS TICKET IS VALID FOR PARKING ONLY GOOD THIS DAY ONLY PREMIER PARKING PASS XFINIT"| __truncated__ "%OVERNIGHT C A M P I N G%* * * * * *%GORGE CAMPGROUND%* GOOD THIS DATE ONLY *%SEP 3 - 6, 2009" "ADAMS STREET GARAGE%PARKING FOR 4/21/06 ONLY%DODGE THEATRE PARKING PASS%ENTRANCE ON ADAMS STREET%BENISE%GARAGE OPENS AT 6:00PM" " THIS TICKET IS VALID FOR PARKING ONLY GOOD FOR THIS DATE ONLY VIP PARKING PASS GEXA"| __truncated__ ...
## $ tickets_purchased_qty : int 1 1 1 1 1 2 1 1 1 1 ...
## $ trans_face_val_amt : num 45 75 5 20 20 10 30 28 20 25 ...
## $ delivery_type_cd : chr "eTicket" "TicketFast" "TicketFast" "Mail" ...
## $ event_date_time : chr "2015-09-12 23:30:00" "2009-09-05 01:00:00" "2006-04-22 01:30:00" "2011-09-03 00:00:00" ...
## $ event_dt : chr "2015-09-12" "2009-09-04" "2006-04-21" "2011-09-02" ...
## $ presale_dt : chr "NULL" "NULL" "NULL" "NULL" ...
## $ onsale_dt : chr "2015-05-15" "2009-03-13" "2006-02-25" "2011-04-22" ...
## $ sales_ord_create_dttm : chr "2015-09-11 18:17:45" "2009-07-06 00:00:00" "2006-04-05 00:00:00" "2011-07-01 17:38:50" ...
## $ sales_ord_tran_dt : chr "2015-09-11" "2009-07-05" "2006-04-05" "2011-07-01" ...
## $ print_dt : chr "2015-09-12" "2009-09-01" "2006-04-05" "2011-07-06" ...
## $ timezn_nm : chr "EST" "PST" "MST" "CST" ...
## $ venue_city : chr "MANSFIELD" "QUINCY" "PHOENIX" "DALLAS" ...
## $ venue_state : chr "MASSACHUSETTS" "WASHINGTON" "ARIZONA" "TEXAS" ...
## $ venue_postal_cd_sgmt_1: chr "02048" "98848" "85003" "75210" ...
## $ sales_platform_cd : chr "www.concerts.livenation.com" "NULL" "NULL" "NULL" ...
## $ print_flg : chr "T " "T " "T " "T " ...
## $ la_valid_tkt_event_flg: chr "N " "N " "N " "N " ...
## $ fin_mkt_nm : chr "Boston" "Seattle" "Arizona" "Dallas" ...
## $ web_session_cookie_val: chr "7dfa56dd7d5956b17587" "4f9e6fc637eaf7b736c2" "6c2545703bd527a7144d" "527d6b1eaffc69ddd882" ...
## $ gndr_cd : chr NA NA NA NA ...
## $ age_yr : chr NA NA NA NA ...
## $ income_amt : chr NA NA NA NA ...
## $ edu_val : chr NA NA NA NA ...
## $ edu_1st_indv_val : chr NA NA NA NA ...
## $ edu_2nd_indv_val : chr NA NA NA NA ...
## $ adults_in_hh_num : chr NA NA NA NA ...
## $ married_ind : chr NA NA NA NA ...
## $ child_present_ind : chr NA NA NA NA ...
## $ home_owner_ind : chr NA NA NA NA ...
## $ occpn_val : chr NA NA NA NA ...
## $ occpn_1st_val : chr NA NA NA NA ...
## $ occpn_2nd_val : chr NA NA NA NA ...
## $ dist_to_ven : int NA 59 NA NA NA NA NA NA NA NA ...
# View a summary of sales
sum_sales <- as.data.frame(do.call(cbind, lapply(sales, summary)))## Warning in (function (..., deparse.level = 1) : number of rows of result is
## not a multiple of vector length (arg 1)
sum_sales[,-1] %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| event_id | primary_act_id | secondary_act_id | purch_party_lkup_id | event_name | primary_act_name | secondary_act_name | major_cat_name | minor_cat_name | la_event_type_cat | event_disp_name | ticket_text | tickets_purchased_qty | trans_face_val_amt | delivery_type_cd | event_date_time | event_dt | presale_dt | onsale_dt | sales_ord_create_dttm | sales_ord_tran_dt | print_dt | timezn_nm | venue_city | venue_state | venue_postal_cd_sgmt_1 | sales_platform_cd | print_flg | la_valid_tkt_event_flg | fin_mkt_nm | web_session_cookie_val | gndr_cd | age_yr | income_amt | edu_val | edu_1st_indv_val | edu_2nd_indv_val | adults_in_hh_num | married_ind | child_present_ind | home_owner_ind | occpn_val | occpn_1st_val | occpn_2nd_val | dist_to_ven | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 1 | 1 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 0 |
| 1st Qu. | character | character | character | character | character | character | character | character | character | character | character | character | 1 | 20 | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | 12 |
| Median | character | character | character | character | character | character | character | character | character | character | character | character | 1 | 30 | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | 26 |
| Mean | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 1.639 | 77.077446 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 158.188854489164 |
| 3rd Qu. | character | character | character | character | character | character | character | character | character | character | character | character | 2 | 85 | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | 77.5 |
| Max. | character | character | character | character | character | character | character | character | character | character | character | character | 8 | 1520.88 | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | character | 2548 |
| NA’s | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 1 | 1 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 4677 |
# Load dplyr
library(dplyr)
# Get a glimpse of sales
glimpse(sales)## Observations: 5,000
## Variables: 46
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...
## $ event_id <chr> "abcaf1adb99a935fc661", "6c56d7f08c95f2...
## $ primary_act_id <chr> "43f0436b905bfa7c2eec", "1a3e9aecd06177...
## $ secondary_act_id <chr> "b85143bf51323b72e53c", "f53529c5679ea6...
## $ purch_party_lkup_id <chr> "7dfa56dd7d5956b17587", "4f9e6fc637eaf7...
## $ event_name <chr> "Xfinity Center Mansfield Premier Parki...
## $ primary_act_name <chr> "XFINITY Center Mansfield Premier Parki...
## $ secondary_act_name <chr> "NULL", "Dave Matthews Band", "NULL", "...
## $ major_cat_name <chr> "MISC", "MISC", "MISC", "MISC", "MISC",...
## $ minor_cat_name <chr> "PARKING", "CAMPING", "PARKING", "PARKI...
## $ la_event_type_cat <chr> "PARKING", "INVALID", "PARKING", "PARKI...
## $ event_disp_name <chr> "Xfinity Center Mansfield Premier Parki...
## $ ticket_text <chr> " THIS TICKET IS VALID FOR PAR...
## $ tickets_purchased_qty <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 4, ...
## $ trans_face_val_amt <dbl> 45, 75, 5, 20, 20, 10, 30, 28, 20, 25, ...
## $ delivery_type_cd <chr> "eTicket", "TicketFast", "TicketFast", ...
## $ event_date_time <chr> "2015-09-12 23:30:00", "2009-09-05 01:0...
## $ event_dt <chr> "2015-09-12", "2009-09-04", "2006-04-21...
## $ presale_dt <chr> "NULL", "NULL", "NULL", "NULL", "2005-0...
## $ onsale_dt <chr> "2015-05-15", "2009-03-13", "2006-02-25...
## $ sales_ord_create_dttm <chr> "2015-09-11 18:17:45", "2009-07-06 00:0...
## $ sales_ord_tran_dt <chr> "2015-09-11", "2009-07-05", "2006-04-05...
## $ print_dt <chr> "2015-09-12", "2009-09-01", "2006-04-05...
## $ timezn_nm <chr> "EST", "PST", "MST", "CST", "PST", "PST...
## $ venue_city <chr> "MANSFIELD", "QUINCY", "PHOENIX", "DALL...
## $ venue_state <chr> "MASSACHUSETTS", "WASHINGTON", "ARIZONA...
## $ venue_postal_cd_sgmt_1 <chr> "02048", "98848", "85003", "75210", "98...
## $ sales_platform_cd <chr> "www.concerts.livenation.com", "NULL", ...
## $ print_flg <chr> "T ", "T ", "T ", "T ", "T ", "T ", "T ...
## $ la_valid_tkt_event_flg <chr> "N ", "N ", "N ", "N ", "N ", "N ", "N ...
## $ fin_mkt_nm <chr> "Boston", "Seattle", "Arizona", "Dallas...
## $ web_session_cookie_val <chr> "7dfa56dd7d5956b17587", "4f9e6fc637eaf7...
## $ gndr_cd <chr> NA, NA, NA, NA, NA, NA, "M", NA, NA, NA...
## $ age_yr <chr> NA, NA, NA, NA, NA, NA, "28", NA, NA, N...
## $ income_amt <chr> NA, NA, NA, NA, NA, NA, "112500", NA, N...
## $ edu_val <chr> NA, NA, NA, NA, NA, NA, "High School", ...
## $ edu_1st_indv_val <chr> NA, NA, NA, NA, NA, NA, "High School", ...
## $ edu_2nd_indv_val <chr> NA, NA, NA, NA, NA, NA, "NULL", NA, NA,...
## $ adults_in_hh_num <chr> NA, NA, NA, NA, NA, NA, "4", NA, NA, NA...
## $ married_ind <chr> NA, NA, NA, NA, NA, NA, "0", NA, NA, NA...
## $ child_present_ind <chr> NA, NA, NA, NA, NA, NA, "1", NA, NA, NA...
## $ home_owner_ind <chr> NA, NA, NA, NA, NA, NA, "0", NA, NA, NA...
## $ occpn_val <chr> NA, NA, NA, NA, NA, NA, "NULL", NA, NA,...
## $ occpn_1st_val <chr> NA, NA, NA, NA, NA, NA, "Craftsman Blue...
## $ occpn_2nd_val <chr> NA, NA, NA, NA, NA, NA, "NULL", NA, NA,...
## $ dist_to_ven <int> NA, 59, NA, NA, NA, NA, NA, NA, NA, NA,...
Before moving on, scroll to the top of the glimpse() output. Notice the first column, X, which appears to just be counting.
Removing redundant info
You may have noticed that the first column of data is just a duplication of the row numbers. Not very useful. Go ahead and delete that column.
Remember that nrow() and ncol() return the number of rows and columns in a data frame, respectively.
Also, recall that you can use square brackets to subset a data frame as follows:
my_df[1:5, ]# First 5 rows of my_df
my_df[, 4]# Fourth column of my_df
Alternatively, you can remove rows and columns using negative indices. For example:
my_df[-(1:5), ]# Omit first 5 rows of my_df
my_df[, -4]# Omit fourth column of my_df
# Remove the first column of sales: sales2
sales2 <- sales[, -1]Information not worth keeping
Many of the columns have information that’s of no use to us. For example, the first four columns contain internal codes representing particular events. The last fifteen columns also aren’t worth keeping; there are too many missing values to make them worthwhile.
An easy way to get rid of unnecessary columns is to create a vector containing the column indices you want to keep, then subset the data based on that vector using single bracket subsetting.
# Define a vector of column indices: keep
keep <- 5:(ncol(sales2) - 15)
# Subset sales2 using keep: sales3
sales3 <- sales2[, keep]Separating columns
Some of the columns in your data frame include multiple pieces of information that should be in separate columns. In this exercise, you will separate such a column into two: one for date and one for time. You will use the separate() function from the tidyr package.
Take a look at the event_date_time column by typing head(sales3$event_date_time) in the console. You’ll notice that the date and time are separated by a space. Therefore, you’ll use sep = " " as an argument to separate().
# Load tidyr
library(tidyr)
# Split event_date_time: sales4
sales4 <- separate(sales3, event_date_time,
c("event_dt", "event_time"), sep = " ")
# Split sales_ord_create_dttm: sales5
sales5 <- separate(sales4, sales_ord_create_dttm,
c("ord_create_dt", "ord_create_time"), sep = " ")## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 4 rows
## [2516, 3863, 4082, 4183].
Did you see the warning message that just popped up in the console? No need to panic (yet). You’ll sort it out in the next exercise.
Dealing with warnings
Looks like that second call to separate() threw a warning. Not to worry; warnings aren’t as bad as error messages. It’s not saying that the command didn’t execute; it’s just a heads-up that something unusual happened.
The warning says Too few values at 4 locations. You may be able to guess already what the issue is, but it’s still good to take a look.
The locations (i.e. rows) given in the warning are 2516, 3863, 4082, and 4183. Have a look at the contents of the sales_ord_create_dttm column in those rows.
# Define an issues vector
issues <- c(2516, 3863, 4082, 4183)
# Print values of sales_ord_create_dttm at these indices
sales3$sales_ord_create_dttm[issues]## [1] "NULL" "NULL" "NULL" "NULL"
# Print a well-behaved value of sales_ord_create_dttm
sales3$sales_ord_create_dttm[2517]## [1] "2013-08-04 23:07:19"
Thee warning was just because of four missing values. You’ll ignore them for now, but if your analysis depended on complete date/time information, you would probably need to delete those rows.
Identifying dates
Some of the columns in your dataset contain dates of different events. Right now, they are stored as character strings. That’s fine if all you want to do is look up the date associated with an event, but if you want to do any comparisons or math with the dates, it’s MUCH easier to store them as Date objects.
Luckily, all of the date columns in this dataset have the substring “dt” in their name, so you can use the str_detect() function of the stringr package to find the date columns. Then you can coerce them to Date objects using a function from the lubridate package.
You’ll use lapply() to apply the appropriate lubridate function to all of the columns that contain dates. Recall the following syntax for lapply() applied to some data frame columns of interest:
lapply(my_data_frame[, cols], function_name)
Also recall that function names in lubridate combine the letters y, m, d, h, m, and s depending on the format of the date/time string being read in.
# Load stringr
library(stringr)
# Find columns of sales5 containing "dt": date_cols
date_cols <- str_detect(names(sales5), "dt")
# Load lubridate
library(lubridate)##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
# Coerce date columns into Date objects
sales5[, date_cols] <- lapply(sales5[, date_cols], ymd)## Warning: 2892 failed to parse.
## Warning: 101 failed to parse.
## Warning: 4 failed to parse.
## Warning: 424 failed to parse.
there were a few more warnings… Sigh.
More warnings!
As you saw, some of the calls to ymd() caused a failure to parse warning. That’s probably because of more missing data, but again, it’s good to check to be sure.
The first two lines of code (provided for you here) create a list of logical vectors called missing. Each vector in the list indicates the presence (or absence) of missing values in the corresponding column of sales5. See if the number of missing values in each column is the same as the number of rows that failed to parse in the previous exercise.
As a reminder, here are the warning messages:
Warning message: 2892 failed to parse.
Warning message: 101 failed to parse.
Warning message: 4 failed to parse.
Warning message: 424 failed to parse.
# Find date columns (don't change)
date_cols <- str_detect(names(sales5), "dt")
# Create logical vectors indicating missing values (don't change)
missing <- lapply(sales5[, date_cols], is.na)
# Create a numerical vector that counts missing values: num_missing
num_missing <- lapply(missing, sum)
# Print num_missing
num_missing## $event_dt
## [1] 0
##
## $presale_dt
## [1] 2892
##
## $onsale_dt
## [1] 101
##
## $ord_create_dt
## [1] 4
##
## $sales_ord_tran_dt
## [1] 0
##
## $print_dt
## [1] 424
Yep, it was missing data again. Ah, the joys of working with real-life datasets!
Combining columns
Sure enough, the number of NAs in each column match the numbers from the warning messages, so missing data is the culprit. How to proceed depends on your desired analysis. If you really need complete sets of date/time information, you might delete the rows or columns containing NAs.
As your last step, you’ll use the tidyr function unite() to combine the venue_city and venue_state columns into one column with the two values separated by a comma and a space. For example, “PORTLAND” “MAINE” should become “PORTLAND, MAINE”.
# Combine the venue_city and venue_state columns
sales6 <- unite(sales5, "venue_city_state", c("venue_city", "venue_state"), sep = ", ")
# View the head of sales6
sales6 %>%
head() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689") %>%
scroll_box(width = "100%", height = "500px")| event_name | primary_act_name | secondary_act_name | major_cat_name | minor_cat_name | la_event_type_cat | event_disp_name | ticket_text | tickets_purchased_qty | trans_face_val_amt | delivery_type_cd | event_dt | event_time | presale_dt | onsale_dt | ord_create_dt | ord_create_time | sales_ord_tran_dt | print_dt | timezn_nm | venue_city_state | venue_postal_cd_sgmt_1 | sales_platform_cd | print_flg | la_valid_tkt_event_flg | fin_mkt_nm |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Xfinity Center Mansfield Premier Parking: Florida Georgia Line | XFINITY Center Mansfield Premier Parking | NULL | MISC | PARKING | PARKING | Xfinity Center Mansfield Premier Parking: Florida Georgia Line | THIS TICKET IS VALID FOR PARKING ONLY GOOD THIS DAY ONLY PREMIER PARKING PASS XFINITY CENTER,LOTS 4 PM SAT SEP 12 2015 7:30 PM | 1 | 45 | eTicket | 2015-09-12 | 23:30:00 | NA | 2015-05-15 | 2015-09-11 | 18:17:45 | 2015-09-11 | 2015-09-12 | EST | MANSFIELD, MASSACHUSETTS | 02048 | www.concerts.livenation.com | T | N | Boston |
| Gorge Camping - dave matthews band - sept 3-7 | Gorge Camping | Dave Matthews Band | MISC | CAMPING | INVALID | Gorge Camping - dave matthews band - sept 3-7 | %OVERNIGHT C A M P I N G%* * * * * %GORGE CAMPGROUND% GOOD THIS DATE ONLY *%SEP 3 - 6, 2009 | 1 | 75 | TicketFast | 2009-09-05 | 01:00:00 | NA | 2009-03-13 | 2009-07-06 | 00:00:00 | 2009-07-05 | 2009-09-01 | PST | QUINCY, WASHINGTON | 98848 | NULL | T | N | Seattle |
| Dodge Theatre Adams Street Parking - benise | Parking Event | NULL | MISC | PARKING | PARKING | Dodge Theatre Adams Street Parking - benise | ADAMS STREET GARAGE%PARKING FOR 4/21/06 ONLY%DODGE THEATRE PARKING PASS%ENTRANCE ON ADAMS STREET%BENISE%GARAGE OPENS AT 6:00PM | 1 | 5 | TicketFast | 2006-04-22 | 01:30:00 | NA | 2006-02-25 | 2006-04-05 | 00:00:00 | 2006-04-05 | 2006-04-05 | MST | PHOENIX, ARIZONA | 85003 | NULL | T | N | Arizona |
| Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow | Gexa Energy Pavilion VIP Parking | NULL | MISC | PARKING | PARKING | Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow | THIS TICKET IS VALID FOR PARKING ONLY GOOD FOR THIS DATE ONLY VIP PARKING PASS GEXA ENERGY PAVILION FRI SEP 02 2011 7:00 PM | 1 | 20 | 2011-09-03 | 00:00:00 | NA | 2011-04-22 | 2011-07-01 | 17:38:50 | 2011-07-01 | 2011-07-06 | CST | DALLAS, TEXAS | 75210 | NULL | T | N | Dallas | |
| Premier Parking - motley crue | White River Amphitheatre Premier Parking | NULL | MISC | PARKING | PARKING | Premier Parking - motley crue | THIS TICKET IS VALID%FOR PARKING ONLY%GOOD THIS DATE ONLY%PREMIER PARKING PASS%WHITE RIVER AMPHITHEATRE%SAT JUL 30, 2005 6:00PM | 1 | 20 | 2005-07-31 | 01:00:00 | 2005-03-02 | 2005-03-04 | 2005-06-18 | 00:00:00 | 2005-06-18 | 2005-06-28 | PST | AUBURN, WASHINGTON | 98092 | NULL | T | N | Seattle | |
| Fast Lane Access: Journey | Fast Lane Access | Journey | MISC | SPECIAL ENTRY (UPSELL) | UPSELL | Fast Lane Access: Journey | FAST LANE JOURNEY FAST LANE EVENT THIS IS NOT A TICKET SAN MANUEL AMPHITHEATER SAT JUL 21 2012 7:00 PM | 2 | 10 | TicketFast | 2012-07-22 | 02:00:00 | NA | 2012-04-11 | 2012-07-21 | 17:20:18 | 2012-07-21 | 2012-07-21 | PST | SAN BERNARDINO, CALIFORNIA | 92407 | www.livenation.com | T | N | Los Angeles |
This dataset is much cleaner. Your next steps would depend on what specific analyses you wanted to perform; for now, we’ll call it a chapter. Next up, you’ll look at some data about “the T”, Boston’s public transit system.
# Save our tidy dataframe to csv file
write.csv(sales6,'../xDatasets/sales_clean.csv')